专利摘要:
The invention relates to a method of clustering each semi-structured document of said plurality of semi-structured documents based on meta-information of said semi-structured documents in a cluster of plurality of clusters, detecting segments in the semi-structured document. structured document for each cluster of said plurality of clusters by classification methods and detecting segment attributes of each segment, by analyzing a textual content of each segment, said segment attributes comprising a plurality of possible sets of named entities with associated confidence level; and matching segments of each of the documents in each cluster based on said segment attributes of each particular segment and a probability distribution of said named entities is performed by determining the meaning of each said segment group comprising comparable segments based on NLP of said contents of each of said segments; and assigning a segment identifier to each segment.
公开号:BE1027433B1
申请号:E20195470
申请日:2019-07-18
公开日:2021-02-15
发明作者:Jež Pavel;Feu Georges De
申请人:Lynxcare Clinical Informatics;
IPC主号:
专利说明:

TECHNICAL FIELD The present invention relates to a method for extracting information from semi-structured documents, from a semi-structured document, an associated system and a processing device.
Background Art Currently, a lot of information is exchanged in a form of semi-structured text.
This is especially the case in the medical domain where many systems output messages in HL7 or XML formats with different definitions.
However, there are many definitions for each of those formats.
In addition, many institutions add their own definitions on top of the standard format to meet their needs.
This makes data exchange complex and limits interoperability between different, for example medical, institutions.
To overcome these difficulties, custom interfaces can be developed to consume messages from a particular producer.
However, such an approach is not scalable, as adding a new message source results in an update of the configuration on the receiver side, often requiring some development and / or manual intervention.
However, the systems and methods for extracting information from semi-structured documents currently known have the disadvantage that such systems still process complex data and experience limited interoperability between different, e.g., medical, institutions.
Disclosure of the Invention
It is an object of the present invention to provide a method for extracting information from semi-structured documents, an associated system and associated processing device that overcome or reduce these said problems.
Accordingly, embodiments of the present invention pertain to a method for extracting information from semi-structured documents, said method comprising the steps of: retrieving a plurality of semi-structured documents from at least one semi-structured document source characterized in that, said method further comprises the steps of: - clustering, by a non-supervised learning algorithm or semi-supervised learning algorithm, each semi-structured document of said plurality of semi-structured documents is based on at least one of meta information, content and layout of each said semi-structured document in a cluster of a plurality of clusters; and - detecting segments in semi-structured document for each cluster of said plurality of clusters by means of an unsupervised classification method or semi-supervised classification method and detecting segment features of each segment, using Natural Language Processing (NLP ), by analyzing a textual content of each said segment, said segment attributes comprising a plurality of possible sets of entity types, each set having a certain confidence level; and - matching segments of each of the documents in each cluster based on said segment attributes of each particular segment and a probability distribution of said named entities; and - determining the meaning of each said segment group comprising similar segments based on a Natural language
Processing said content of each segment in said segment group; and assigning a segment identifier to each segment based on the determined concept distribution of each segment.
Another relevant embodiment relates to the method of extracting information from semi-structured documents according to claim 1, characterized in that, said method further comprises the step of determining relationships between segments of documents in said cluster by applying Natural Language Processing.
Another relevant embodiment relates to a method for extracting information from a semi-structured document according to claim 1, characterized in that, said clustering step may be further based on the graphical layout of such a document.
Another embodiment of the present invention relates to a method for extracting information from a semi-structured document according to claim 1, wherein in said step of detecting features, context is used as input parameter, for the Natural Language Processing.
Another relevant embodiment relates to a method for extracting information from a semi-structured document according to claim 1, characterized in that, the step of said analyzing a textual content of each said segment is based on a set of algorithms organized in different layers.
Another relevant object relates to a processing device, for extracting information from semi-structured documents retrieved from at least one semi-structured document source, which processing device comprises a processing means (3) configured for: - clustering, by an unsupervised learning algorithm or semi-supervised learning algorithm, of each semi-structured document of said plurality of semi-structured documents based on at least one of meta-information, content and layout of each said semi-structured document in a cluster of a plurality of clusters; and - detecting segments in semi-structured document for each cluster of said plurality of clusters by unsupervised classification methods or semi-supervised classification methods; and detecting segment features of each segment, using Natural Language Processing, by analyzing a textual content of each said segment, said segment features comprising a plurality of possible sets of named entities, each set having a certain confidence level; and - matching segments of each of the documents in each cluster based on said segment attributes of each particular segment and a probability distribution of said named entities; and - determining the meaning of each said segment group comprising similar segments based on Natural Language Processing of said content of each of said segments in said segment group; and - assigning a segment identifier to each segment based on the determined concept distribution of each segment.
A further relevant embodiment relates to the processing device for extracting information from semi-structured documents according to claim 6, characterized in that, said processing device is further configured for: - determining relationships between segments of documents in said cluster by applying Natural language Processing.
Yet a further relevant embodiment relates to the processing device for extracting information from semi-structured documents according to claim 6, characterized in that said processing device is further configured for:
- basing said cluster of each semi-structured document of said plurality of semi-structured documents on the graphical layout of such a document.
Another relevant embodiment relates to a processing device for extracting information from semi-structured documents according to claim 6, characterized in that, said processing device is further configured to: detect segment features by additionally applying a context as an input parameter for the Natural Language Processing.
Yet another relevant embodiment relates to a processing device for extracting information from semi-structured documents according to claim 6, characterized in that, said processing device is further configured to: analyze a textual content of each said segment based on the whole of algorithms organized in different layers.
Another relevant embodiment relates to a system for extracting information from semi-structured documents, said system comprising means configured to: retrieve a plurality of semi-structured documents from at least one semi-structured document source, characterized in that, said system further comprises a processing device according to claim 6. Indeed, this objective is achieved by, when retrieving a plurality of semi-structured documents from at least one semi-structured document source, clustering each semi-structured document from the plurality of semi-structured documents by means of a non-supervised or semi-supervised learning algorithm and then recording each retrieved semi-structured document from a plurality of semi-structured documents in a cluster (group of documents) based on at least one of meta information, content and layout of each semi structured document and then detecting segments in each semi-structured document in each cluster of the plurality of clusters by means of unsupervised clustering methods such as k-averages clustering or hierarchical clustering.
It is also possible to use semi-supervised clustering methods that use different sample documents, where domain experts have indicated the type, to train a model that organizes documents into clusters.
Any supervised classification algorithm (random forest, support vector machines, neural network to name a few) can be used as a starting point for a continuous, semi-supervised learning process.
The sample of the results is then checked and, if necessary, corrected by domain experts to improve model performance.
The attributes that will be used by the cluster algorithm include not only the document text, but also the layout.
The position of a text in a document can indicate a certain type of document (eg, laboratory results with many tables are visually different from clinical letters with continuous text). Finally, it is possible to detect the type of document based on a frequency of certain keywords.
For example, a term frequency-inverse document frequency (Rajaraman & Ullman, 2011) can be used to check the relevance (match) of the document to a particular document cluster, which can be defined by one or more keywords.
Then, segment attributes are detected for each detected segment of all documents in a cluster, by analyzing a textual content of each said detected segment, the segment attributes comprising a plurality of possible sets of named entities and each set having a certain confidence level.
This step of detecting segment features is followed by the step of matching segments from each of the documents in each cluster based on the segment features of each segment that are determined and a probability distribution of said named entities; This matching is intended to group segments of the same type across different documents.
For example, the segments describing the operation in the operation reports can be grouped based on correspondence between concepts they contain. This similarity can be measured by comparing so-called segment vectors that are calculated as a superposition of the term vectors obtained by an algorithm for determining medical text similarity, such as the UMLS2vec algorithm.
Further, a meaning is determined for each segment group comprising similar segments, this meaning being extracted by means of a Natural Language Processing of the content of each of the segments in the particular segment group. The meaning of each segment is condensed into the segment vector (superposition of all concept vectors) which contains numerical representation (provided by UMLS2Vec algorithm) of all concepts detected by the NLP algorithm. Thus, by comparing this vector with the vector assigned to the segments identified by domain experts, it is possible to assign an identification tag to it (e.g., the institution's address or the patient's medical history). By then assigning a segment identifier, such as a label, to each segment based on the determined concept distribution of each segment, all segments of the documents in a particular cluster are identified and labeled and thus assigned a structure that is recognizable and recognizable .
It should be noted that a semi-structured document may contain surgical reports, clinical letters, laboratory results or any other document in the patient's electronic health record, as well as messages such as a communication between the hospital and a patient in the form of tweets, comments or contributions on social media, SMS or phone call transcripts that are tracked through private storage capabilities such as an SQL, nSQL or Hadoop database that can be hosted on-site or in a remote computing center for each of the respective hospitals.
It should be noted that such clustering of the semi-structured documents of said plurality of semi-structured documents may be based on at least one of meta-information, content and layout of each said semi-structured document where such meta-information can contain a document file type, a document name, document date and further metadata such as (size, originator, and location, etc.), the title of the document, the number of paragraphs or sections and / or word frequency.
Such unsupervised learning algorithm can be implemented by the unsupervised methods such as k-means clustering or hierarchical clustering. It is also possible to use semi-supervised methods where various sample documents where domain experts have indicated the type are used to train a model that organizes documents into clusters. Any supervised classification algorithm (random forest, support vector machines, neural network to name a few) can be used as a starting point for a continuous, semi-supervised learning process. The sample of the results is then checked and, if necessary, corrected by domain experts to improve model performance.
A segment of such a document can be, for example, a paragraph, a table, a document line in HL7, or an element in XML or JSON or a row in a form, a tweet from many with the same user or hashtag or a comment on a social network , or a section in an article or other logically separated unit of text of any origin.
The segment feature detection is the step of detecting segment features in each said segment by analyzing a textual content of each said segment, the step resulting in the extracting of named entities. Such named entities are the entities of interest in the text that indicate relevant items / topics in a segment of a document. Other segment attributes can be, for example, the title, an XML or HL7 tag, previous and next segments and / or the length and position of the segment.
Analyzing the textual content of each named segment is performed using an NLP algorithm such as sequence tagging that detects concepts of interest in sentences and / or segments and assigns them a label indicating the type. The sequence tagging algorithms include, but are not limited to, conditional random fields, long / short term memory or other recurrent neural network, different Markov models or multinomial logistic classifications, or a combination thereof.
A further relevant embodiment relates to the method for extracting information from semi-structured documents according to claim 1, the method further comprising the step of determining relationships between segments of documents in said cluster using NLP. In other words, the method of the present embodiment determines relationships between segments that are relationships between dissimilar segments, which determination of relationships is performed by performing a further Natural language Processing. By determining between segments that are non-comparable segments, it is possible to improve the identification of segments. Non-comparable segments can be related, for example by being present in the same document or by being related to the same patient. Certain categories of documents (e.g. surgery report) contain certain segments, e.g. pre-surgery diagnosis, surgery description, etc. If a particular segment cannot be uniquely identified, it is possible to limit the possibilities by excluding such segment labels that are already were present in a particular document.
Another relevant embodiment relates to a method for extracting information from semi-structured messages / document according to claim 1, wherein said clustering step may be further based on the graphical layout of such a document.
The determination of the cluster of a received semi-structured document can be optimized by, in addition to using the content of the document, applying the results of an analysis of the graphical layout of a document resulting in additional information for deciding to which cluster a document belongs.
Another embodiment of the present invention relates to a method for extracting information from semi-structured documents according to claim 1, including the step of detecting features using context as input parameter to the NLP algorithm.
After all, if the context of a document is already known, this context can be applied in the step of detecting segment features using this context as input parameter to the NLP algorithm parameter, so that decision making can be improved and thus confidence levels increase.
A further relevant embodiment relates to a method for extracting information from a semi-structured document according to claim 1, characterized in that the step of said analyzing a textual content of each said segment is based on a set of algorithms organized in different layers. This means that the top layers use the knowledge and output of the algorithms of the bottom layers. The bottom layer can be a generic, yet language-specific NER (Named Entity Algorithm) algorithm that can recognize items such as people, numbers, dates, addresses, geographic location, etc. The next layer can contain general medical knowledge so that it can extract entities such as doctor, patient, medication, etc. Transfer learning from the bottom layer makes training from this layer easier - for example, if the layer wants to extract a doctor, they don't need to learn to extract a person, as this is already done by the underlying layer. The layer, on the other hand, specializes in discerning whether this person is a doctor or patient.
Above the generic medical layer can be a layer specific to a particular department / medical field, for example orthopedics, neonatology, etc. This layer is trained to extract terms specific to that particular department. For example, a knee prosthesis is labeled as a "man-made object" by the generic layer, "medical device" by the generic medical layer, and a "knee prosthesis" by the orthopedic layer. Other layers are optional and user-specific. Their job is to detect something that is specific to a particular department and not used in general. Typically, it can be a clinical trial when a customer wants to access an effect of a new drug / treatment that is not yet part of the general methodology of the field.
Brief description of the drawings The invention will be further elucidated with reference to the following description and the appended figures.
Figure 1 shows a system for extracting information from semi-structured documents.
Figure 2 shows a more detailed system for extracting information from semi-structured documents.
Modes for Carrying Out the Invention The present invention will be described with reference to specific embodiments and with reference to certain drawings. However, the invention is not limited thereto, and is limited only by the claims. The drawings described are only schematic and are not limiting. In the drawings, the size of some elements may be exaggerated and not drawn to scale for illustrative purposes. The dimensions and relative dimensions do not necessarily correspond to actual reductions to practice the invention.
Furthermore, the terms first, second, third and the like are used in the specification and in the claims to distinguish between similar elements and not necessarily to describe a consecutive or chronological order. The terms are interchangeable under appropriate circumstances and the embodiments of the invention may operate in sequences other than those described or illustrated herein.
In addition, the terms at the top, bottom, top, bottom and the like in the description and in the claims are used for descriptive purposes and not necessarily to describe relative positions. The terms so used are interchangeable under appropriate circumstances and the embodiments of the invention described herein may operate in other orientations than described or illustrated herein.
The term "comprising", used in the claims, should not be construed as being limited to the means set forth below; it does not exclude other elements or steps. It should be interpreted as specifying the presence of the listed features, integers, steps or components referenced, but does not exclude the presence or addition of one or more other features, integers, steps or components or groups thereof. The scope of the expression "a device comprising means A and B" should therefore not be limited to devices consisting only of components A and B. It means that with respect to the present invention, the only relevant components of the device are A and B.
In the following sections, with reference to the drawing in FIG. 1, an implementation of the system for analyzing / extracting information from semi-structured documents according to an embodiment of the present invention. In the next section, all connections between these elements are defined. Next, all relevant functional means of the system for extracting information from semi-structured documents as presented in FIG. 1, followed by a description of all interconnections.
The following section describes the actual implementation of extracting information from semi-structured documents according to an embodiment of the present invention using the system.
A first essential element of the system 1 is a document receiving means 2 configured to retrieve a plurality of semi-structured documents from at least one semi-structured message or document source, each such source 8,9,10 having a respective database of a first, second and third institution such as a hospital can be.
A second essential element is the processing means 3 which is first configured for clustering, by executing a non-supervised learning algorithm or semi-supervised learning algorithm, each semi-structured document of said plurality of semi-structured documents based on at least one of meta information, content and layout of each said semi-structured document in a cluster of a plurality of clusters; and - configured to detect segments in (each) semi-structured document in / from each cluster of said plurality of clusters by means of unsupervised classification algorithms, such as hierarchical or k-averages clustering, or semi-supervised classification algorithms that allow human feedback improve the performance of supervised classifications; and - configured to detect segment attributes of each segment, using NLP, by analyzing a textual content of each said segment, said segment attributes comprising a plurality of possible sets of named entities, each set having a certain confidence level; and
- configured to match segments of each of the documents in each cluster based on said segment characteristics of each particular segment and a probability distribution of said named entities. Matching is the grouping of segments of the same type across different documents.
The processing means 3 is further configured to determine the meaning of each said segment group comprising comparable segments based on a Natural language Processing of said content of each said segment in said segment group and in addition to assign a segment identifier to each segment based on the determined concept division in said segment. The meaning of the segment is condensed into the segment vector which contains numerical representation of all concepts detected by the NLP algorithm. Thus, by comparing this vector with the vector assigned to the segments identified by domain experts, it is possible to assign an identification tag to it (e.g., the institution's address or the patient's medical history).
A further essential resource is a storage resource which may consist of a single database or a plurality of local or distributed databases 4, 5, 6 and 7 as shown in FIG. 1 where all document clusters are stored, in combination with or separate from all results of the processing steps of the processing means 3.
It is assumed that the number of institutions, e.g. first hospital 8, second hospital 9 and third hospital 10, forward a variety of semi-structured documents such as patient reports, clinical letters, laboratory results, but possibly also a communication between the hospital and a patient. the form of tweets, social media responses or contributions, SMS or transcripts of telephone conversations tracked through private storage capabilities, such as an SQL, nSQL, or Hadoop database that can be hosted on-site or in a remote computing center for any of the respective hospitals.
The plurality of semi-structured documents are received by the receiving means and inputted to the processing means 3 which in turn initiates the clustering of each received semi-structured document, by first subjecting a received semi-structured document to the execution of a non-structured document. supervised learning algorithm such as k-averages or hierarchical clustering. In another embodiment, it is possible to use any supervised classification with an expert feedback as a semi-supervised learning algorithm.
Each semi-structured document of said plurality of semi-structured documents based on at least one of meta-information, content and layout of each said semi-structured document in a cluster of a plurality of clusters is assigned to a particular cluster and is stored as such in a respective database. We assume that cluster 1 has been allocated, the document is stored in the first database 5.
This clustering is done automatically, using one or more algorithms for so-called unsupervised learning. The document attributes, i.e. the meta information, that can be used to determine which cluster a document belongs to can include the document file type, document name, document date and other metadata of the computer file (size, originator, location ...), but also content-related attributes such as the title of the document, number of paragraphs or sections or word frequency.
Additional information for the document classification, that is, the clustering, can be the layout of the semi-structured document, because, for example, a letter has a different layout from the operation report or the laboratory results. Such a document can be converted into a black and white image that is black in the areas where there is text and white in the other case. Therefore, an additional unsupervised learning algorithm can be used to divide the document set into different clusters. This can be done by vectorizing the images directly (by assigning 1 or 0 to each pixel depending on whether it is black or white) or by using an autoencoder neural net or other machine learning algorithm that extracts the relevant latent features from the image that matches the document and assigns a vector with a different length than the number of pixels in the image to the document. The resulting feature vectors can be grouped using hierarchical or k-means clustering or other unsupervised clustering methods.
The content-based and graphical layout-based clustering output is then combined to provide a more reliable and robust result.
In some cases, the type of document is known - in this case, the category of the document is used to train the supervised classifications to distinguish between different types of documents based on their content and layout.
The separation rules between different document types learned by this cluster algorithm are saved and reused when a new batch of documents arrives from the same context.
When a batch of new documents arrives from a new context (e.g. so far the algorithm has only seen cardiology documents and now it receives oncology reports), it uses already learned rules as a starting position and checks if the document set can be consistently separated into clusters. Obviously, the processing to determine the clustering needs to adjust itself a bit, but this phase of unsupervised adjustment is much easier than before when no context was known, as the algorithm now only identifies the difference between the two medical contexts. must learn and not have to start all over again. In practice, the algorithm clusters the document based on the rules it has learned from the previous concept. Only then does the next phase of unsupervised learning begin. The advantage is that learning here already starts with fairly well defined clusters, while for the first context (s) the algorithm started with random cluster assignment and then iteratively tried to find the best distribution for document clusters so that documents with similar attributes are in the same cluster.
After the step of clustering the received semi-structured documents, the processing means processes each semi-structured document for each cluster to detect segments.
This processing includes detection of segments in each cluster.
A segment in this case can be, for example, a paragraph, a table, a message line in HL7 or an element in the XML or JSON.
At this stage, we can also take advantage of the use of the graphical layout analysis as already used in clustering, because the pieces of text that form a single cluster are often also visually linked in the document.
This use of graphic layout leads to a more reliable and robust result in segment detection.
The segment detection is analogous to the cluster detection - in the latter case, the set of all documents was divided into different groups that have something in common.
In the case of segment detection, the set of all lexical tokens (strings with an assigned and thus identified meaning) within a document is divided into different groups that have something in common - a section, line in the form, etc.
The segments are detected using unsupervised methods such as k-means clustering or hierarchical clustering.
It is also possible to use semi-supervised methods where different _ sample documents, where domain experts indicate segment boundaries, are used to train a model that detects segments in further documents.
Any supervised classification algorithm (random forest, support vector machines, neural network to name a few) can be used as a starting point for a continuous, semi-supervised learning process.
The sample of the results is then checked and, if necessary, corrected by domain experts to improve model performance.
The attributes used by the segment detection algorithm include not only the document text but also the layout.
The position of a text in a document can indicate that it belongs to a particular segment. Finally, it is possible to detect the segment boundaries (and thus segments) by means of various regular expressions that look for typical means of dividing the segments (various newlines, numbered titles, page breaks, etc.) after or parallel with the step of detecting segments, the processing means processes (each) semi-structured document for each cluster to detect segment features of each segment using NLP to analyze a textual content of each said segment. The segment characteristics include a multitude of possible sets of named entities, each set having a certain confidence level. The NLP algorithms can be organized in layers, which means that the algorithms from the top layers use the knowledge and output of the algorithms in the bottom layers. The bottom layer can be a generic, but language-specific, NER algorithm that can recognize items such as people, numbers, dates, addresses, geographic location, etc. The next layer can contain general medical knowledge, so it can extract entities such as doctor, patient, medication, etc. Transfer learning from the bottom layer makes training from this layer easier - for example, if the layer wants to extract a doctor, they don't need to learn to extract a person, as this is already done by the underlying layer. The layer, on the other hand, specializes in discerning whether this person is a doctor or patient.
Above the generic medical layer can be a layer specific to a particular department / medical field, for example orthopedics, neonatology, etc. This layer is trained to extract terms specific to that particular department. For example, a knee prosthesis is labeled a "man-made object" by the generic layer, "medical device" by the generic medical layer, and a "knee prosthesis" by the orthopedic layer. Other layers are optional and user-specific. Their job is to detect something that is specific to a particular department and not used in general. Typically, it can be a clinical trial when a customer wants to access an effect of a new drug / treatment that is not yet part of the general methodology of the field.
General characteristics of segments are detected in this step. Examples of such segment attributes are segment type (paragraph, table and element name in the case of XML or JSON, message type for HL7), length, position in document, etc.
The textual content of a segment is also analyzed and named entities are extracted using the NLP algorithm described previously. It returns the list of all possible named entities along with the confidence level of the algorithm decision. For example, the result of this processing phase is that the particular section 10 contains problems with high confidence, some procedures with medium confidence, and some lexical tokens that can be both equipment and a drug depending on the context. In other words, the results are different sets of entity types with different confidence levels.
If the context is known, this context can be provided as an input parameter to the NLP algorithm to improve its decision making and confidence levels. On the other hand, if the context is unknown, the algorithm returns several most likely interpretations within the contexts known to the processing system. To check the match with some known contexts, a medical text matching algorithm, such as the UMLS2Vec algorithm, is applied to calculate the distance of the CUIs detected in the segments from those CUIs normally expected for a given context: With the UMLS2Vec algorithm, a vector can be assigned to any detected concept in a particular segment. Superposition of those vectors returns a segment vector. If there are more possible interpretations of a segment within known contexts, a vector is constructed for each context hypothesis. This vector can be compared to a distribution of the segment vectors for a particular segment in the assumed context.
If the detected segment vector is statistically compatible with that distribution, we can assume that the segment is from the assumed context.
If no known context is compatible with any interpretation of the segment, all possible interpretations of all concepts are saved and disambiguation is done at the end of the document processing.
Disambiguation, then, means finding the most likely interpretation of all combinations of concept interpretations. The most likely interpretation, for vectors assigned through the UMLS2Vec algorithm, is the one with the least variance. This means it contains terms that are related. Higher variance means that the relationship between concepts is smaller or nonexistent.
After or in parallel with the previous steps, the processing means matches 3 segments of each of the documents, i.e. Determines similar segments and grouping, in each cluster based on said segment characteristics of each particular segment and a probability distribution of said named name entities The segments with a For example, description of the operation in the operation reports can be grouped based on correspondence between concepts they contain. This match can be measured by comparing so-called segment vectors that are calculated as a superposition of the term vectors obtained by a medical text matching algorithm such as the UMLS2Vec algorithm.
After or in parallel with the previous steps, the processing means 3 determines the meaning of each said segment group where such segment group is a group of segments comprising similar segments. The interpretation of such a segment group is determined by applying Natural language Processing of said content of each of said segments in said segment group.
Subsequently, by assigning a segment identifier, such as a label, to each segment based on the determined concept distribution of each segment, all segments of the documents in a particular cluster are identified and / or labeled and thus assigned a structure that recognizable and recognizable.
The meaning of the segment is condensed into the segment vector which contains numerical representation of all concepts detected by the NLP algorithm. By selecting among the segment identifiers assigned to the segments identified by domain experts from the segment identifier with the smallest angle between the vector assigned to the segment studied and the vector corresponding to an expert labeled segment, we can segment type and thus derive the label.
The concepts in the medical domain can be described using UMLS which assigns a unique code (CUI - concept unique identifier) to each concept. In addition, UMLS also contains different types of relationships between the concepts (e.g. parent-child, broader relationship, close relationship, etc.). We can use these relationships as a constraint that would allow us to define an N-dimensional value in which each CUI would correspond to N-dimensional vector and consequently it would be possible to access how close different concepts are not directly his relative are together.
For example, if Concept A is a parent of Concept B and Concept B is a parent of Concept C, there is also a relationship between A and C, even though they are not directly related. This goes without saying, as their distance is greater than that between A and B or B and C respectively, so that 2 steps have to be performed.
Formally, finding vectors corresponding to all concepts would be equivalent to solving a system (or multiple systems) of linear equations.
To describe the way the concept is vectorized, let's first introduce some definitions:
* 2 UMLS concepts A and B are directly related if such a bilateral relationship is defined of any type that includes both concepts A and B. For example, A is older of B; A is caused by B 2 UMLS concepts A and B are related if an ordered, finite list of different concepts A1, A », As,…, An exists, so that each consecutive pair (A and A4, A; and Az, …, An and B) are directly related. The relationships in any direct relationship can be of any type. For example, A cures A4, A; is parent of A »etc. and B is finally an expression of AN. Because of this definition, all directly related CULs are also related. However, the reverse is not true.
The UMLS does not guarantee that all concepts are related, nor that at least one bilateral relationship is defined for each concept. Therefore, the first step in the vectorization is the identification of the largest connected sets constructed in the following way: 1) A CUI is randomly chosen. This CUI is the first member of the set. Then, using UMLS, all directly related CUIS are found and added to the set 2) For each CUI found in the previous step, all directly related CUIS are found and added to the set unless they are already there 3) Step 2 is repeated until no new CUIS can be added. Then the remaining CUIS are used to assemble the next set using steps 1-3 above. After M steps we obtain M sets with one or more CUIs. In any set that has more than one CUI, each CUI is related to at least one other CUI in the same set. At the same time, it is not related to a CUI from another set.
The second step is to assign each pair of related CULs a real number that corresponds to the strength of the relationship. A large number indicates a very strong relationship, while zero is used for all CUI pairs that are not related in the UMLS.
The relationship strength can be determined by the shortest path (smallest number of intermediaries) between the related CULs. Let's assume there are K relationship types defined in the UMLS. Then we can define 2 parameters a, and B for each relationship type with values from the interval (-1,1).
The relationship strength can then be calculated using the formula: r => A, + BE k Where k runs over all relationship types defined along the shortest path and N is a distance from the furthest intermediary connected through a relationship of type k. With this construction we obtain a real number for every pair of CUL's that are related. This can be represented in a matrix R of dimensions N1, x Nc, where N ,, is the number of CUL's in a previously constructed connected set i. We can now use matrix factorization to obtain matrix Y such that R = Y "xY For successful matrix multiplication, the matrix Y must have Ne, columns and any number of rows f, so that we can see that each column of matrix Y has an f-dimensional vector that corresponds to a particular CUI There are many proven techniques to practice matrix factorization, for example the Alternating least squares method.
This factorization is done for each set of related CULs. The numbers f, a, and Bx are free parameters and their value is tuned for the optimal performance of the factorization process.
The outcome of this process is that we can assign an f-dimensional vector to each CUI and thus determine the similarity of any pair of CUIs from the set of related CUIs, even for those that are not related in the UMLS. We can also calculate a center of mass for each group of CULs from one set of related CULs and thus compare a group of CULs.
权利要求:
Claims (11)
[1]
A computer-controlled method for extracting information from semi-structured electronic documents, said method comprising the steps of: - retrieving, under the control of a computer-controlled processing unit, a plurality of semi-structured electronic documents from at least one semi-structured electronic document. structured electronic document source, characterized in that, said method further comprises the steps of: - under the control of a computer-controlled processing unit, clustering, by a non-supervised learning algorithm or semi-supervised learning algorithm under the control of a computer-controlled processing unit, of each semi structured electronic document of said plurality of semi-structured electronic documents based on at least one of i) meta-information, ii) content and iii) layout of each said semi-structured electronic document in a cluster of a plurality of clusters; and - detecting under the control of a computer controlled processing unit segments in semi-structured electronic document for each cluster of said plurality of clusters by means of unsupervised classification methods or semi-supervised classification methods; and detecting, under the control of a computerized processor, segment attributes of each segment, using Natural Language Processing, by analyzing a textual content of each said segment, said segment attributes comprising a plurality of possible sets of named entities , where each set has a certain confidence level; and matching, under the control of a computer-controlled processor, segments of each of the electronic documents in each cluster based on said segment characteristics of each particular segment and a probability distribution of said named entities in a segment group; and - determining under the control of a computer-controlled processor the meaning of each said segment group comprising similar segments based on a Natural language Processing of said content of each said segment in said segment group; and - under the control of a computer-controlled processor, assigning a segment identifier to each segment based on the determined concept distribution of each segment.
[2]
A computer-controlled method for extracting information from semi-structured electronic documents according to claim 1, characterized in that said method further comprises the step of: - under the control of a computer-controlled processing unit, determining relationships between segments of documents in said cluster by applying Natural Language Processing.
[3]
A computer-controlled method for extracting information from semi-structured electronic document according to claim 1, characterized in that, said clustering step may be further based on the graphical layout of such an electronic document.
[4]
Computer-controlled method for extracting information from semi-structured electronic document according to claim 1, characterized in that, in said feature detecting step, the context is used as input parameter for the Natural Language Processing.
[5]
A computer controlled method for extracting information from semi-structured electronic document according to claim 1, characterized in that, the step of said analyzing a textual content of each said segment is based on a set of algorithms organized in different layers.
[6]
Computer controlled processing device, for extracting information from semi-structured electronic documents retrieved from at least one semi-structured document source, said computer controlled processing device comprising a processing means (3) configured for: - clustering, by a non-supervised learning algorithm or semi-supervised learning algorithm, of each semi-structured electronic document of said plurality of semi-structured electronic documents based on at least one of meta-information, content and layout of each said semi-structured electronic document in a cluster of a plurality of clusters; and - detecting segments in semi-structured electronic document for each cluster of said plurality of clusters by unsupervised classification methods or semi-supervised classification methods; and detecting segment features of each segment, using Natural Language Processing, by analyzing a textual content of each said segment, said segment features comprising a plurality of possible sets of named entities, each set having a certain confidence level; and matching segments of each of the electronic documents in each cluster based on said segment characteristics of each particular segment and a probability distribution of said named entities; and - determining the meaning of each said segment group comprising similar segments based on a Natural language Processing of said content of each of said segments in said segment group; and assigning a segment identifier to each segment based on the determined concept distribution of each segment.
[7]
A computer-controlled processing device for extracting information from semi-structured electronic documents according to claim 6, characterized in that said processing device is further configured for:
-determining relationships between segments of electronic documents in said cluster by applying Natural language Processing.
[8]
Computer-controlled processing device for extracting information from semi-structured electronic documents according to claim 6, characterized in that, said processing device is further configured to: - base said cluster of each semi-structured electronic document from said plurality of semi-structured electronic documents. electronic documents on the graphical layout of such electronic document.
[9]
Computer-controlled processing device for extracting information from semi-structured electronic documents according to claim 6, characterized in that, said processing device is further configured to: - detect segment features by additionally applying a context as an input parameter for the Natural Language Processing .
[10]
Computer-controlled processing device for extracting information from semi-structured electronic documents according to claim 6, characterized in that, said processing device is further configured to: analyze a textual content of each said segment based on a set of algorithms operating in different layers are organized.
[11]
A computer controlled system for extracting information from semi-structured electronic documents, said system comprising means configured to: retrieve a plurality of semi-structured electronic documents from at least one semi-structured electronic document source, characterized in that , said computer based system further comprising a processing device according to claim 6.
类似技术:
公开号 | 公开日 | 专利标题
JP2020516997A|2020-06-11|System and method for model-assisted cohort selection
JP2019049964A|2019-03-28|Automatic identification and extraction of medical condition and fact from electronic medical treatment record
Devarakonda et al.2015|Automated Problem List Generation from Electronic Medical Records in IBM Watson.
Gao et al.2017|An interpretable classification framework for information extraction from online healthcare forums
Sarker et al.2015|Automatic evidence quality prediction to support evidence-based decision making
BE1027433B1|2021-02-15|A method of extracting information from semi-structured documents, an associated system and a processing device
Pertsas et al.2018|Ontology-driven information extraction from research publications
Ghamami et al.2018|Why biomedical relation extraction is an open issue?
Noriega-Atala et al.2019|Extracting Inter-Sentence Relations for Associating Biological Context with Events in Biomedical Texts
Noriega-Atala et al.2018|Inter-sentence relation extraction for associating biological context with events in biomedical texts
Manda2019|Sentiment Analysis of Twitter Data Using Machine Learning and Deep Learning Methods
Umarani et al.2019|Predicting Safety Information of Drugs Using Data Mining Technique
Kongburan et al.2019|Enhancing predictive power of cluster-boosted regression with text-based indexing
CN113868406B|2022-03-11|Search method, search system, and computer-readable storage medium
Thomas et al.2019|Evidence surveillance to keep up to date with new research
JP2022504508A|2022-01-13|Systems and methods for model-assisted event prediction
Butcher2021|Contract Information Extraction Using Machine Learning
Senthilkumar et al.2015|A unified approach to detect the record duplication using bat algorithm and fuzzy classifier for health informatics
US20210166822A1|2021-06-03|Method and apparatus for selecting radiology reports for image labeling by modality and anatomical region of interest
EP3937105A1|2022-01-12|Methods and systems for user data processing
Constantopoulos et al.2019|From publications to knowledge graphs
Wunnava et al.2018|Multi-layered Learning for Information Extraction from Adverse Drug Event Narratives
Ashish et al.2018|Machine reading of biomedical data dictionaries
EnriqueNoriega-Atala et al.2020|Extracting Inter-Sentence Relations for Associating Biological Context with Events in Biomedical Texts
Sánchez-de-Madariaga et al.2022|Semi-supervised incremental learning with few examples for discovering medical association rules
同族专利:
公开号 | 公开日
WO2021009375A1|2021-01-21|
BE1027433A9|2021-02-25|
BE1027433A1|2021-02-09|
BE1027433B9|2021-03-01|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题
US20060026203A1|2002-10-24|2006-02-02|Agency For Science, Technology And Research|Method and system for discovering knowledge from text documents|
法律状态:
2021-04-19| FG| Patent granted|Effective date: 20210215 |
优先权:
申请号 | 申请日 | 专利标题
BE20195470A|BE1027433B9|2019-07-18|2019-07-18|A method of extracting information from semi-structured documents, an associated system and a processing device|BE20195470A| BE1027433B9|2019-07-18|2019-07-18|A method of extracting information from semi-structured documents, an associated system and a processing device|
PCT/EP2020/070367| WO2021009375A1|2019-07-18|2020-07-17|A method for extracting information from semi-structured documents, a related system and a processing device|
[返回顶部]